Text Mining Task
In this analysis, we delved into the tweet engagement patterns for Bern University of Applied Sciences (BFH) compared to other Swiss higher education institutions. We utilized various visual and textual data to identify key engagement metrics and content strategies that could optimize BFHβs social media performance.
Data Import & Analysis
In this area, we will load and analyze the data for the work in order to make decisions on how to proceed.
Data Import
We Import the dataset SP500_data.csv and make a copy to work
with it and named it data. We copy it so we can be secure that
i do not make any changes in the original dataset.
We use several
libraries to process the tasks and get the output that is asked for.
# Load data set and make a copy of the original
set.seed(123)
options(scipen=999)
Tweets_all <- load("Tweets_all.rda")Data Exploration
This section gives a concise view of the tweets on the Swiss
University Social Media accounts data.
The dataset consists of
19β575 observations and 14 variables:
Time Range and Tweet Frequency:
- The tweets are from September 29, 2009 to January 26, 2023 and this indicates a long-term use of Twitter
- The median tweet date is April 13, 2018, suggesting that half of the
tweets were posted after this date and the data is skewed
Retweet and Favorite Counts:
- The data shows a minimum of 0 and a maximum of 267 retweets and 188 likes per tweet
- the median and first quartile for retweets and likes are 0, indicating that many tweets receive little to no engagement
- The
in_reply_to_screen_namefield suggests that some tweets are responses to other users, which might indicate engagement or conversation strategies used by the university
ID and String Variables:
- The
idandid_strfields are technical identifiers for tweets, indicating that tweets have been collected over a wide range of tweets
Language and University Fields:
- The
langshows the common language used at the university universityshows the abbreviation of the university
Temporal Patterns:
created_at,tweet_date,tweet_hour, andtweet_monthprovide detailed temporal data- can be analyzed to understand peak times of activity and seasonal or
monthly trends in tweeting behavior.
Content Analysis
The word cloud represents the most frequently used words in the filtered tweets with high engagement (likes or retweets). Key observations include:
Frequent Terms: Larger words such as βbachelor,β βdesign,β βdie,β βdas,β βder,β and βampβ indicate their higher occurrence. Key Topics: βbachelorβ for Bachelorβs programs or graduates. βdesignβ related to design courses or projects. βHSLUβ (Hochschule Luzern). General terms: βschweiz,β βzeigen,β βnicht.β Note: The term βampβ appears due to HTML encoding and is not meaningful.
## # A tibble: 6 Γ 14
## created_at id id_str full_text in_reply_to_screen_nβ¦ΒΉ
## <dttm> <dbl> <chr> <chr> <chr>
## 1 2023-01-20 17:17:32 1.62e18 1616469988369469β¦ "Im MSc β¦ <NA>
## 2 2023-01-13 07:52:01 1.61e18 1613790954737074⦠"Was bew⦠<NA>
## 3 2023-01-12 19:30:01 1.61e18 1613604227141537⦠"Was uns⦠<NA>
## 4 2023-01-12 08:23:00 1.61e18 1613436367169634⦠"Eine di⦠<NA>
## 5 2023-01-11 14:00:05 1.61e18 1613158809081450⦠"Wir gra⦠<NA>
## 6 2023-01-10 17:06:11 1.61e18 1612843252083834β¦ "Unsere β¦ <NA>
## # βΉ abbreviated name: ΒΉβin_reply_to_screen_name
## # βΉ 9 more variables: retweet_count <int>, favorite_count <int>, lang <chr>,
## # university <chr>, tweet_date <dttm>, tweet_minute <dttm>,
## # tweet_hour <dttm>, tweet_month <date>, timeofday_hour <chr>
## created_at id
## Min. :2009-09-29 14:29:47.0 Min. : 4468752018
## 1st Qu.:2015-01-28 15:07:41.5 1st Qu.: 560439073866000000
## Median :2018-04-13 13:26:56.0 Median : 984754806702000000
## Mean :2017-12-09 15:26:50.7 Mean : 939953703992000000
## 3rd Qu.:2020-10-20 10:34:50.0 3rd Qu.:1318470720360000000
## Max. :2023-01-26 14:49:31.0 Max. :1618607065240000000
## id_str full_text in_reply_to_screen_name
## Length:19575 Length:19575 Length:19575
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## retweet_count favorite_count lang university
## Min. : 0.000 Min. : 0.00 Length:19575 Length:19575
## 1st Qu.: 0.000 1st Qu.: 0.00 Class :character Class :character
## Median : 1.000 Median : 0.00 Mode :character Mode :character
## Mean : 1.289 Mean : 1.37
## 3rd Qu.: 2.000 3rd Qu.: 2.00
## Max. :267.000 Max. :188.00
## tweet_date tweet_minute
## Min. :2009-09-29 00:00:00.00 Min. :2009-09-29 14:29:00.00
## 1st Qu.:2015-01-28 00:00:00.00 1st Qu.:2015-01-28 15:07:00.00
## Median :2018-04-13 00:00:00.00 Median :2018-04-13 13:26:00.00
## Mean :2017-12-09 02:25:45.00 Mean :2017-12-09 15:26:24.68
## 3rd Qu.:2020-10-20 00:00:00.00 3rd Qu.:2020-10-20 10:34:30.00
## Max. :2023-01-26 00:00:00.00 Max. :2023-01-26 14:49:00.00
## tweet_hour tweet_month timeofday_hour
## Min. :2009-09-29 14:00:00.00 Min. :2009-09-01 Length:19575
## 1st Qu.:2015-01-28 14:30:00.00 1st Qu.:2015-01-01 Class :character
## Median :2018-04-13 13:00:00.00 Median :2018-04-01 Mode :character
## Mean :2017-12-09 14:59:43.81 Mean :2017-11-24
## 3rd Qu.:2020-10-20 10:00:00.00 3rd Qu.:2020-10-01
## Max. :2023-01-26 14:00:00.00 Max. :2023-01-01
Data Manipulation
Next will prepare the data for analysis.
Languages
Here, we calculate the frequency of each language present in the
tweets dataset and sorts these frequencies in descending order.
The
output indicates that German (de) is the most common language with
14,474 occurrences, followed by Italian (it) with 1,865 and French (fr)
with 1,792. English (en) comes next with 1,280 tweets. The frequencies
of other languages, including rare and less commonly used ones, are also
listed, showcasing the linguistic diversity in the dataset.
# Count the frequency of each language
lang_counts <- table(tweets$lang)
# Sort the language frequencies in descending order
sort(lang_counts, decreasing = TRUE)##
## de it fr en qam qme es ca da ro nl in et
## 14474 1865 1792 1280 35 21 19 10 10 10 9 7 6
## und pt zxx art lv cy fi lt no qht cs eu ht
## 6 4 4 3 3 2 2 2 2 2 1 1 1
## ja sv tl tr
## 1 1 1 1
Due to the fact that German, Italian, French and English are the
most frequently listed languages and other languages are not used in
large numbers and are not among the most spoken languages in
Switzerland, we limit the dataset to the 4 most important ones.
# Filter the DataFrame to keep only tweets in German, Italian, French and English
filtered_tweets <- tweets[tweets$lang %in% c("de", "it", "fr", "en"), ]
# Check the resulting language distribution
table(filtered_tweets$lang)##
## de en fr it
## 14474 1280 1792 1865
This gives us the new summary of the dataset:
- Number of Records: The total count of tweets has decreased from 19,575 to 19,411, indicating some tweets have been removed or filtered out.
- Date and Time: Minimal changes are reflected across the median and mean values.
- Other Attributes: No significant changes are observed in the ranges.
## created_at id
## Min. :2009-09-29 14:29:47.00 Min. : 4468752018
## 1st Qu.:2015-02-04 11:39:32.00 1st Qu.: 562923403041000000
## Median :2018-04-17 13:53:07.00 Median : 986210946744999936
## Mean :2017-12-11 15:27:49.55 Mean : 940675313339000064
## 3rd Qu.:2020-10-20 11:09:15.50 3rd Qu.:1318479385120000000
## Max. :2023-01-26 14:49:31.00 Max. :1618607065240000000
## id_str full_text in_reply_to_screen_name
## Length:19411 Length:19411 Length:19411
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## retweet_count favorite_count lang university
## Min. : 0.000 Min. : 0.000 Length:19411 Length:19411
## 1st Qu.: 0.000 1st Qu.: 0.000 Class :character Class :character
## Median : 1.000 Median : 0.000 Mode :character Mode :character
## Mean : 1.293 Mean : 1.376
## 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :267.000 Max. :188.000
## tweet_date tweet_minute
## Min. :2009-09-29 00:00:00.0 Min. :2009-09-29 14:29:00.00
## 1st Qu.:2015-02-04 00:00:00.0 1st Qu.:2015-02-04 11:39:00.00
## Median :2018-04-17 00:00:00.0 Median :2018-04-17 13:53:00.00
## Mean :2017-12-11 02:26:53.7 Mean :2017-12-11 15:27:23.56
## 3rd Qu.:2020-10-20 00:00:00.0 3rd Qu.:2020-10-20 11:09:00.00
## Max. :2023-01-26 00:00:00.0 Max. :2023-01-26 14:49:00.00
## tweet_hour tweet_month timeofday_hour
## Min. :2009-09-29 14:00:00.00 Min. :2009-09-01 Length:19411
## 1st Qu.:2015-02-04 11:30:00.00 1st Qu.:2015-02-01 Class :character
## Median :2018-04-17 13:00:00.00 Median :2018-04-01 Mode :character
## Mean :2017-12-11 15:00:42.28 Mean :2017-11-26
## 3rd Qu.:2020-10-20 10:30:00.00 3rd Qu.:2020-10-01
## Max. :2023-01-26 14:00:00.00 Max. :2023-01-01
Emojis
The package emo is used for emoji analysis in R, which
is essential for text data that includes emojis. This is useful for
cleaning data, extracting information, or preparing text for further
analysis.
Understanding the prevalence of emojis can help analyze
sentiment, user engagement, or cultural trends in social media data.
# Install the emo package from GitHub for Emoji analyzes
if (!require("emo")) {
remotes::install_github("hadley/emo")
}## Lade nΓΆtiges Paket: emo
Tweet Analysis
In this section we will use the prepared data to analyze the tweets for frequency, interactions and universities.
Tweet Frequency Analysis
In this section we will analyze the tweets for frequency of Swiss universities.
Tweet Frequency over Time
Each histogram shows fluctuations in tweet volumes over the years
- Universities like HSLU and ZHAW: Display prominent peaks at certain intervals, possibly indicating targeted social media campaigns or significant events that engaged the university community.
- Other Universities (e.g., BFH, FHNW): Some show a steady level of activity with occasional spikes, while others might exhibit a decline or increase in activity, suggesting changes in social media strategy or external factors impacting engagement.
# Code to analyze tweet frequencies by time and university
p1<- filtered_tweets %>%
mutate(tweet_month = floor_date(created_at, "month")) %>%
group_by(university, tweet_month) %>%
summarize(count = n(), .groups = 'drop') %>%
ggplot(aes(x = tweet_month, y = count, fill = university)) +
geom_col(position = "dodge") +
theme_minimal() +
labs(title = "Monthly Tweet Frequency by University", x = "Year", y = "Number of Tweets")
# Convert to interactive plotly object
interactive_plot <- ggplotly(p1, tooltip = "text")
# Optionally, add configurations to enhance interaction
interactive_plot <- interactive_plot %>% layout(
hovermode = 'closest',
title = "Click on a University to see its Tweet Trends",
showlegend = TRUE
)
interactive_plotTweet Frequency - Terms
Here we return terms that meet the high frequency threshold.
Text Preprocessing
We create a text corpus from filtered_tweets$clean_text,
where each tweet is treated as a separate document.
The corpus
serves as the foundational structure for text analysis, allowing for
uniform processing and manipulation of the text data.
# Corpus: Collection of text documents that generally serves as a basis for analysis in text processing and text mining.
# VectorSource(tweets): This vector is then used as the source for the corpus, whereby each entry in the vector becomes a separate document in the corpus.
# It is important that the text is extracted, as the corpus should only work with text data.
corpus <- Corpus(VectorSource(filtered_tweets$clean_text))
Here we clean the corpus by converting all text to lowercase,
removing punctuation, numbers, and stopwords from German, French,
Italian, and English, and finally stripping extra spaces.
Cleaning
the text is crucial for reducing noise and focusing analyses on
meaningful words only. This standardizes the text data, making
subsequent analyses like topic modeling or sentiment analysis more
effective and less prone to error due to textual inconsistencies.
# Clean text
corpus <- tm_map(corpus, content_transformer(tolower)) # Convert to lower case
corpus <- tm_map(corpus, removePunctuation) # Removing punctuation marks
corpus <- tm_map(corpus, removeNumbers) # Removing numbers
corpus <- tm_map(corpus, removeWords, stopwords("german")) # Removing stop words
corpus <- tm_map(corpus, removeWords, stopwords("french"))
corpus <- tm_map(corpus, removeWords, stopwords("italian"))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace) # Removal of additional spaces
corpus <- tm_map(corpus, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus <- tm_map(corpus, content_transformer(function(x) {
x <- gsub("β", "", x)
x <- gsub("β¦", "", x)
x <- gsub("Β«", "", x)
x <- gsub("Β»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
Here we create a Document-Term Matrix (DTM) from the corpus,
applying additional filters like punctuation removal and stopping word
exclusion during the matrix formation. Then, it filters out terms that
appear in less than 1% of the documents to reduce sparsity.
Reducing sparsity helps focus on terms that have significant presence
across documents, enhancing the reliability and performance of
statistical models and algorithms applied later.
# Create DTM and remove sparse terms
dtm1 <- DocumentTermMatrix(corpus, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm1 <- removeSparseTerms(dtm1, sparse = 0.99) # Adjust sparsity threshold as neededTerms Analysis:
- Dominant Themes: Words like βschweizerβ (Swiss), βunternehmenβ (companies), βzukunftβ (future), βinnovβ (innovation), and βdigitalβ suggest that the text data heavily revolves around themes of Swiss companies, innovation, and digital advancements.
- Common Words: Frequent appearance of terms like βdankβ (thanks), βneueβ (new), βmehrβ (more), and βinfoβ indicate common communication patterns possibly related to news dissemination or updates about new developments and initiatives.
set.seed(123)
# Ensure word names are captured
word_freq1 <- sort(rowSums(as.matrix(dtm1)), decreasing = TRUE)
top_word_freq1 <- head(word_freq1, 80)
word_names1 <- colnames(dtm1)
# Generate word cloud using the correct word names
wordcloud(
words = word_names1,
freq = top_word_freq1,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)Tweets Frequency - Emojis
- Engagement Strategy: The frequent use of directional emojis like β‘οΈ, ‡οΈ, and π suggests that guiding readers to additional content or important links is a successful strategy.
- Content Themes: Emojis like π, π, π», and π‘ highlight the focus on education, research, and technology.
- Celebratory Communication: Emojis such as π, π, π, and π₯³ signify celebration and achievement.
# Analyze the frequency of different emojis and select the top 50
emoji_freq2 <- table(unlist(filtered_tweets$emojis))
sort(emoji_freq2, decreasing = TRUE)[1:50]##
## β‘οΈ β€΅οΈ π π π π¨π π‘ π» π π π£ π β¨ π¬ π π¬
## 414 247 180 117 97 75 67 67 65 63 57 56 45 44 38 36
## π€ π π
π€ π¨ π ποΈ π π π π πΈ π πͺ β‘ π±
## 36 35 32 32 32 30 28 26 26 25 23 23 22 22 21 21
## π©βπ βΆοΈ π π
βοΈ π¨βπ π π³ π₯ π₯³ πΎ π π’ π π π€
## 21 20 20 20 19 19 19 18 18 18 17 17 17 17 17 17
## π π
## 16 16
High Engagement
In this section, we want to focus on tweets that have attracted more attention and interaction.
High Engagement - Terms
Text Preprocessing:
This section sets a variable engagement_threshold to 20,
which is used as the minimum number of likes or retweets a tweet must
have to be considered as having βhigh engagementβ. This threshold helps
to focus on tweets that have garnered more attention and
interaction.
# Set a threshold for "high engagement" (e.g., tweets with at least 20 likes or retweets)
engagement_threshold <- 20
# Filter tweets based on this engagement threshold
high_engagement_tweets <- filtered_tweets %>%
filter(favorite_count >= engagement_threshold | retweet_count >= engagement_threshold)Also for the high_engagement_tweets we clean the corpus
by converting all text to lowercase, removing punctuation, numbers, and
stopwords from German, French, Italian, and English, and finally
stripping extra spaces and we create a Document-Term Matrix (DTM) from
this corpus.
# Rebuild the corpus with the sampled data
corpus2 <- Corpus(VectorSource(high_engagement_tweets$clean_text))
corpus2 <- tm_map(corpus2, content_transformer(tolower)) # Convert to lower case
corpus2 <- tm_map(corpus2, removePunctuation) # Removing punctuation marks
corpus2 <- tm_map(corpus2, removeNumbers) # Removing numbers
corpus2 <- tm_map(corpus2, removeWords, stopwords("german")) # Removing stop words
corpus2 <- tm_map(corpus2, removeWords, stopwords("french"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("italian"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("english"))
corpus2 <- tm_map(corpus2, stripWhitespace) # Removal of additional spaces
corpus2 <- tm_map(corpus2, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus2 <- tm_map(corpus2, content_transformer(function(x) {
x <- gsub("β", "", x)
x <- gsub("β¦", "", x)
x <- gsub("Β«", "", x)
x <- gsub("Β»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
# Create DTM and remove sparse terms
dtm <- DocumentTermMatrix(corpus2, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm <- removeSparseTerms(dtm, sparse = 0.99) # Adjust sparsity threshold as neededText Analyse:
The word cloud effectively illustrates which topics are most engaging
within the parameter for at least 20 likes or retweets. This
visualization can help in refining the communication and engagement
strategies by focusing on the topics that naturally engage your
audience.
- βforscherteamβ (research team) and βentwickeltβ (developed): suggest a strong emphasis on research and development topics.
- βlabβ: indicates discussions possibly related to laboratory work or scientific studies.
- βdataβ and βdigitalβ: reflect a focus on digital technologies and data science, crucial in contemporary research and education.
- βopenβ: could relate to open source, open access, or openness in research and education, pointing towards transparency and accessibility in academic resources.
- βneinβ (no) and βwiesoβ (why): might indicate debates or discussions, possibly questioning certain methods or findings.
- βschweizerβ (Swiss): identifies the national or cultural context, implying that the content is likely relevant to or originating from Swiss institutions or discussing Swiss innovations.
- βgesprΓ€chβ (conversation): underscores the interactive or dialogical nature of the tweets, suggesting that engagement may be driven by conversational or discursive posts.
- Not well cleaned elements: The presence of strings like βhttpβ might be artifacts from URLs or specific hashtags, which although not directly meaningful, indicate the inclusion of links or specific calls to action in the tweets.
set.seed(123)
# Ensure word names are captured
word_freq <- sort(rowSums(as.matrix(dtm)), decreasing = TRUE)
top_word_freq <- head(word_freq, 80)
word_names <- colnames(dtm)
# Generate word cloud using the correct word names
wordcloud(
words = word_names,
freq = top_word_freq,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)High Engagement - Emojis
- Utility and Guidance: Directional emojis like β‘οΈ and π suggest that providing clear guidance or calls to action within tweets is effective in garnering engagement.
- Cultural and International Appeal: The presence of multiple national flags suggests that tweets connected to specific national contexts or international discussions.
- Emotional and Informative Content: Emojis like β¨ (sparkles) and π (heart) are often used to add emotional depth or positivity to tweets. Similarly, π (calendar) and π’ (megaphone) likely denote event-related or important announcements that command attention.
# Analyze the frequency of different emojis
emoji_freq1 <- table(unlist(high_engagement_tweets$emojis))
sort(emoji_freq1, decreasing = TRUE)##
## β‘οΈ π¨π β€΅οΈ β¨ π¨π³ π¬π§ π³π± πΈπͺ πΈπ¬ π π π
π’ ποΈ π π π π¨
## 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
High Engagement - Hours
This graph shows tweet engagement by hour, indicating that the peak time for high engagement tweets occurs at 16:00 (4 PM). Engagement appears to be generally higher in the afternoon hours compared to the morning and late evening.
# Analyze and plot tweet counts by hour to find the best posting times
best_posting_hours1 <- high_engagement_tweets %>%
group_by(timeofday_hour) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
# Plotting
ggplot(best_posting_hours1, aes(x = timeofday_hour, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Tweet Engagement by Hour",
x = "Hour of the Day",
y = "Number of High Engagement Tweets") +
theme_minimal()High Engagement - Days
It shows that Monday and Tuesday are the days with the highest engagement, indicating these might be optimal days for posting to maximize visibility and interaction. The engagement noticeably declines as the week progresses, with the lowest engagement occurring over the weekend, suggesting less audience activity during these days.
# Extract the day of the week from 'tweet_date'
high_engagement_tweets <- high_engagement_tweets %>%
mutate(day_of_week = wday(tweet_date, label = TRUE, week_start = 1)) # Adjust 'week_start' if your week starts on a different day
# Analyze and plot tweet counts by day of the week
best_posting_days1 <- high_engagement_tweets %>%
group_by(day_of_week) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
# Plotting
ggplot(best_posting_days1, aes(x = day_of_week, y = count)) +
geom_bar(stat = "identity", fill = "coral") +
labs(title = "Tweet Engagement by Day of the Week",
x = "Day of the Week",
y = "Number of High Engagement Tweets") +
theme_minimal()Engagement Analysis by University
The bar chart visualizes the total likes accumulated by different
universities within the parameter for at least 20 likes or retweets,
highlighting variations in engagement across these institutions on
social media.
The visualization clearly shows which universities
are receiving the most engagement in terms of likes. HSLU (Lucerne
University of Applied Sciences and Arts) and ZHAW (Zurich University of
Applied Sciences) stands out with the highest engagement, significantly
more than other institutions. So institutions like HSLU and ZHAW,
offering a pathway for others to refine their social media tactics.
# Analysis of likes and retweets
high_engagement_tweets %>%
group_by(university) %>%
summarize(total_likes = sum(favorite_count), total_retweets = sum(retweet_count), .groups = 'drop') %>%
ggplot(aes(x = reorder(university, total_likes), y = total_likes)) +
geom_col(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Engagement Analysis by University", x = "University", y = "Total Likes")## Warning in geom_col(stat = "identity", fill = "steelblue"): Ignoring unknown
## parameters: `stat`
HSLU & ZHAW Engagement Analysis
In this area, we will analyze the universities HSLU (Lucerne University of Applied Sciences and Arts) and ZHAW (Zurich University of Applied Sciences) to find out why they have significantly more interactions compared to other universities.
HSLU & ZHAW Engagement - Terms
Text Preprocessing:
For this, we must again take text prepossessing measures, as in the previous analyses
#Filter Tweets for HSLU and ZHAW
hslu_zhaw_tweets <- filtered_tweets %>%
filter(university %in% c("hslu", "ZHAW"))
# Set a threshold for "high engagement" (e.g., tweets with at least 10 likes or retweets)
engagement_threshold1 <- 20
# Filter tweets based on this engagement threshold
hslu_zhaw_high_engagement_tweets <- hslu_zhaw_tweets %>%
filter(favorite_count >= engagement_threshold1 | retweet_count >= engagement_threshold1)
# Rebuild the corpus with the sampled data
corpus3 <- Corpus(VectorSource(hslu_zhaw_high_engagement_tweets$clean_text))
corpus3 <- tm_map(corpus3, content_transformer(tolower)) # Convert to lower case
corpus3 <- tm_map(corpus3, removePunctuation) # Removing punctuation marks
corpus3 <- tm_map(corpus3, removeNumbers) # Removing numbers
corpus3 <- tm_map(corpus3, removeWords, stopwords("german")) # Removing stop words
corpus3 <- tm_map(corpus3, removeWords, stopwords("french"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("italian"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("english"))
corpus3 <- tm_map(corpus3, stripWhitespace) # Removal of additional spaces
corpus3 <- tm_map(corpus3, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus3 <- tm_map(corpus3, content_transformer(function(x) {
x <- gsub("β", "", x)
x <- gsub("β¦", "", x)
x <- gsub("Β«", "", x)
x <- gsub("Β»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
# Create DTM and remove sparse terms
dtm2 <- DocumentTermMatrix(corpus3, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm2 <- removeSparseTerms(dtm2, sparse = 0.99) # Adjust sparsity threshold as neededText Analyse:
The word cloud illustrates which topics are most engaging within the
parameter for at least 10 likes or retweets.
- Environmental Focus: Terms like βKlimazielβ suggest discussions around climate goals, indicating a strong environmental or sustainability focus within the discourse.
- Economic Impact: Words such as βprofitierenβ and βBeitragβ highlight discussions on economic benefits and contributions, potentially related to how environmental goals can align with economic gains.
- Educational Context: The presence of βZHAWβ directly ties the content to the Zurich University of Applied Sciences, suggesting these topics are relevant to university-led discussions or initiatives.
- National Relevance: The inclusion of βSchweizβ ties the discussions to Switzerland, indicating that these topics are of national interest, potentially discussing Swiss policies or initiatives regarding sustainability.
- Not well cleaned elements: The presence of strings like βhttpstcoezfrwxuβ might be artifacts from URLs, which although not directly meaningful, indicate the inclusion of links or specific calls to action in the tweets.
set.seed(123)
# Ensure word names are captured
word_freq2 <- sort(rowSums(as.matrix(dtm2)), decreasing = TRUE)
top_word_freq2 <- head(word_freq2, 80)
word_names2 <- colnames(dtm2)
# Generate word cloud using the correct word names
wordcloud(
words = word_names2,
freq = top_word_freq2,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)HSLU & ZHAW Engagement - Emojis
- Direction: The use of right arrowβ (β‘οΈ) or βright-pointing fingerβ (π), suggests a focus on direction or continuation, potentially indicating links or further content.
- Positive Emojis: The inclusion of positive emojis like βyellow heartβ (π), or βsmiley faceβ (π) indicates a friendly, positive communication style.
- Local Topics: The Swiss flag (π¨π) might be used in contexts relating to national pride or local topics. Overall, these emojis contribute to engaging and positive social media interactions, which could be part of why these universities have higher engagement rates.
# Analyze the frequency of different emojis
emoji_freq <- table(unlist(hslu_zhaw_high_engagement_tweets$emojis))
sort(emoji_freq, decreasing = TRUE)##
## β‘οΈ π¨π π π π π
## 2 1 1 1 1 1
HSLU & ZHAW Engagement - Hours
The early morning (8 AM) and late afternoon to early evening (17 PM and 18 PM) are the most effective times to post content that is likely to get high engagement. It suggests that timing posts to align with these peak periods could enhance visibility and interaction for the universities social media content.
# Analyze and plot tweet counts by hour to find the best posting times
best_posting_hours <- hslu_zhaw_high_engagement_tweets %>%
group_by(timeofday_hour) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
# Plotting
ggplot(best_posting_hours, aes(x = timeofday_hour, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Tweet Engagement by Hour for HSLU and ZHAW",
x = "Hour of the Day",
y = "Number of High Engagement Tweets") +
theme_minimal()HSLU & ZHAW Engagement - Days
This graph illustrates the distribution of high engagement tweets for HSLU and ZHAW by day of the week, showing that Tuesday is the most effective day to post on Twitter for maximizing engagement at these universities. The sharp drop in engagement over the weekend further supports the trend that weekdays, particularly the beginning of the week, are optimal for reaching the audience.
# Extract the day of the week from 'tweet_date'
hslu_zhaw_high_engagement_tweets <- hslu_zhaw_high_engagement_tweets %>%
mutate(day_of_week = wday(tweet_date, label = TRUE, week_start = 1)) # Adjust 'week_start' if your week starts on a different day
# Analyze and plot tweet counts by day of the week
best_posting_days <- hslu_zhaw_high_engagement_tweets %>%
group_by(day_of_week) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
# Plotting
ggplot(best_posting_days, aes(x = day_of_week, y = count)) +
geom_bar(stat = "identity", fill = "coral") +
labs(title = "Tweet Engagement by Day of the Week for HSLU and ZHAW",
x = "Day of the Week",
y = "Number of High Engagement Tweets") +
theme_minimal()BFH Frequency Analysis
In this section, we will analyze BFH (Bern University of Applied
Sciences) to find out what tweets they usually use the most.
Text Preprocessing:
For this we must again take text prepossessing measures, as in the previous analyses.
#Filter Tweets for HSLU and ZHAW
bfh_tweets <- filtered_tweets %>%
filter(university %in% "bfh")
# Rebuild the corpus with the sampled data
corpus4 <- Corpus(VectorSource(bfh_tweets$clean_text))
corpus4 <- tm_map(corpus4, content_transformer(tolower)) # Convert to lower case
corpus4 <- tm_map(corpus4, removePunctuation) # Removing punctuation marks
corpus4 <- tm_map(corpus4, removeNumbers) # Removing numbers
corpus4 <- tm_map(corpus4, removeWords, stopwords("german")) # Removing stop words
corpus4 <- tm_map(corpus4, removeWords, stopwords("french"))
corpus4 <- tm_map(corpus4, removeWords, stopwords("italian"))
corpus4 <- tm_map(corpus4, removeWords, stopwords("english"))
corpus4 <- tm_map(corpus4, stripWhitespace) # Removal of additional spaces
corpus4 <- tm_map(corpus4, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus4 <- tm_map(corpus4, content_transformer(function(x) {
x <- gsub("β", "", x)
x <- gsub("β¦", "", x)
x <- gsub("Β«", "", x)
x <- gsub("Β»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
# Create DTM and remove sparse terms
dtm3 <- DocumentTermMatrix(corpus4, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm3 <- removeSparseTerms(dtm3, sparse = 0.99) # Adjust sparsity threshold as neededText Analyse:
The word cloud illustrates which topics have the highest frequency.
- Practical and Innovative Focus: Terms like βPraxisβ (practice) and βInnovβ (innovation) indicate a strong link between academic content and real-world applications, appealing particularly to an audience interested in actionable and cutting-edge information.
- Community and Collaboration: Words such as βzusammenβ (together) and βunsereβ (our) reflect a community-focused approach, promoting collective efforts and teamwork within the university setting.
- Local Identity and Quality: The mention of βSchweizerβ (Swiss) suggests content with a national focus, likely resonating with local pride, while βQualitΓ€tβ (quality) underscores the universityβs commitment to high standards in education and research.
set.seed(123)
# Ensure word names are captured
word_freq3 <- sort(rowSums(as.matrix(dtm3)), decreasing = TRUE)
top_word_freq3 <- head(word_freq3, 80)
word_names3 <- colnames(dtm3)
# Generate word cloud using the correct word names
wordcloud(
words = word_names3,
freq = top_word_freq3,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)Most Frequent Emojis:
- Technology and Innovation: Directional emojis like π and devices such as π» and innovations (π, π, π€) dominate, highlighting content on technological advancements and future trends.
- Environmental Themes: Nature-related emojis (π΄, π², π³, β»οΈ) emphasize environmental issues and sustainability efforts.
- Community and Celebrations: Emojis like π and π are used for celebrations and achievements, fostering community spirit.
- Health and Lifestyle: Emojis like π₯₯, π₯¦, and π₯ suggest a focus on health and nutrition.
- Global and Cultural Awareness: Symbols like π and π, along with the Swiss flag π¨π, point to global awareness and local identity.
# Analyze the frequency of different emojis
emoji_freq3 <- table(unlist(bfh_tweets$emojis))
sort(emoji_freq3, decreasing = TRUE)[1:30]##
## π π π π΄ π² π π‘ π» π π€ π¨π π³ π π
π₯₯ π±
## 49 16 12 11 10 10 10 10 10 10 9 9 9 9 9 8
## π π₯ β¨ π β»οΈ π π π₯¦ βοΈ π π‘ π΄ π¨βπ π₯
## 8 8 7 7 6 6 6 6 5 5 5 5 5 5
BFH Engagement - Hours
The highest volume of tweets is sent between 08:00 and 09:00, with a notable peak also around 06:00. Thereβs a gradual decline in tweet activity as the day progresses, especially after 17:00, indicating lower activity in the evening.
# Analyze and plot tweet counts by hour to find the best posting times
best_posting_hours2 <- bfh_tweets %>%
group_by(timeofday_hour) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
# Plotting
ggplot(best_posting_hours2, aes(x = timeofday_hour, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Tweet Frequency by Hour for BFH",
x = "Hour of the Day",
y = "Number of Frequency Tweets") +
theme_minimal()BFH Engagement - Days
The data shows a consistently high level of tweeting activity from Monday through Friday, with the peak on Wednesday, followed by a sharp decline during the weekend.
# Extract the day of the week from 'tweet_date'
bfh_tweets <- bfh_tweets %>%
mutate(day_of_week = wday(tweet_date, label = TRUE, week_start = 1)) # Adjust 'week_start' if your week starts on a different day
# Analyze and plot tweet counts by day of the week
best_posting_days2 <- bfh_tweets %>%
group_by(day_of_week) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
# Plotting
ggplot(best_posting_days2, aes(x = day_of_week, y = count)) +
geom_bar(stat = "identity", fill = "coral") +
labs(title = "Tweet Frequency by Day of the Week for BFH",
x = "Day of the Week",
y = "Number of Frequency Tweets") +
theme_minimal()Recommendations
In this section, we will outline the most important recommendations for action for BFH from the previous analyses, so that they can use these measures to generate an increase in interactions on their Twitter account.
Optimal Terms
Focus on digital and innovative topics: Words like βdigitalβ, βdigitβ, βdataβ, and βopenβ are topics around digitalization and open data initiatives are attracting a lot of interest.
Emphasize sustainability and climate targets: Terms like βclimate goalβ and βsustainβ show that discussions around sustainability and environmental responsibility resonate. This is partly implemented with sustainability.
Use interactive elements: The use of direct address such as βgesprΓ€chβ can help increase interactivity and community engagement.
Expand and include the target audience: Words like βchuniversβ (CH universities) suggest that a broader discussion on topics that affect multiple universities will resonate.
These contributions on these topics should be taken up more by BFH in order to achieve more interactions on the account.
Optimal Emojis
Emojis such as β‘οΈ π¨π π π π should be used more, as these were the most commonly used in terms of high interaction in other universities. The emoji π is already widely used and should continue to be used.
Optimal Posting Times
Analysis shows that tweets generate the most engagement in the early mornings and late afternoons. It would be advisable to schedule important announcements and content during these times.
Optimal Posting Weekday
Engagement analysis by day of the week shows that from Monday to Thursday engagement is generally higher than on weekends. Tuesday, in particular, is a day that receives the most interactions on the accounts. BFH should consider focusing its main communication on these days and especially on Tuesday.
Conclusion
Our findings reveal that high engagement for tweets correlates
strongly with content focused on digital transformation, sustainability,
and institutional activities. Notably, tweets featuring themes of
innovation, open data, and environmental initiatives received more
engagement, reflecting a broader interest in these areas among the
audience.
Furthermore, the timing of posts plays a crucial role in
maximizing visibility and interaction. The data indicated specific hours
and days where engagement peaked, suggesting optimal times for posting
to ensure maximum reach.
By implementing the recommendations, BFH
can increase engagement on their social media platforms and maybe also
reinforce their position as a forward-thinking and impact educational
institution in Switzerland.